Detections, Bounds, and Timelines: UMass and TDT-3
نویسندگان
چکیده
This report presents the system used by the University of Massachusetts for its participation in three of the five TDT tasks this year: detection, first story detection, and story link detection. For each task, we discuss the parameter setting approach that we used and the results of our system on the test data. In addition, we use TDT evaluation approaches to show that the tracking performance that sites are achieving is what is expected from Information Retrieval technology. We further show that any first story detection system based on a tracking approach is unlikely to be sufficiently accurate for most purposes. Finally, we present an overview of an automatic timeline generation system that we developed using TDT data. 1. BASIC SYSTEM The core of our TDT system uses a vector model for representing stories—i.e., we represent each story as a vector in term-space, where coordinates represent the frequency of a particular term in a story. Terms (or features) of each vector are single words, reduced to their root form by a dictionary-based stemmer. This system was originally developed for the 1999 summer workshop at Johns Hopkins University’s Center for Language and Speech Processing.[1] 1.1. Detection algorithms Our system supports two models of comparing a story to previously seen material: centroid (agglomerative clustering) and nearest neighbor comparison. Centroid In this approach, we group the arriving documents into clusters. The clusters represent topics that were discussed in the news stream in the past. Each cluster is represented by a centroid, which is an average of the vector representatives of the stories in that cluster. Incoming stories are compared to the centroid of every cluster, and the closest cluster is selected. If the similarity of the story to the closest cluster exceeds a threshold, we declare the story old and adjust the cluster centroid. If the similarity does not exceed the threshold, we declare the story new, and create a new singleton cluster with the story as its centroid. k-nearest neighbor The second approach, k-NN, does not attempt to explicitly model a notion of a topic, and instead declares the story new if it is not like any story seen before. Incoming stories are directly compared to all the stories we have seen before. The most similar neighbors are found, and if the story’s similarity to the neighbors exceeds a threshold, the story is declared old. Otherwise the story is declared new. 1.2. Similarity functions One important issue in our approach is the problem of determining the right similarity function. We considered four functions: cosine, weighted sum, language models, and Kullbach-Leiblar divergence. The critical property of the similarity function is its ability to separate stories that discuss the same topic from stories that discuss different topics. Cosine The cosine similarity is a classic measure used in Information Retrieval, and is consistent with a vector-space representation of stories. The measure is simply an inner product of two vectors, where each vector is normalized to unit length. It represents the cosine of the angle between the two vectors and . (Note that if and have unit length, the denominator is 1.0 and the angle is calculated by a simple dot product.) Cosine similarity tends to perform best at full dimensionality, as in the case of comparing two long stories. Performance degrades as one of the vectors becomes shorter. Because of the built-in length normalization, cosine similarity is less dependent on specific term weighting, and performs well when raw word counts are presented as weights. Weighted sum The weighted sum is an operator used in the InQuery retrieval engine developed at the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts. InQuery is a Bayesian inference engine with transition matrices restricted to constant-space deterministic operators (e.g., AND, OR, SUM). Weighted sum represents a linear combination of evidence with weights representing confidences associated with various pieces of evidence: where represents the query vector and represents the document vector. For instance, in the centroid model, cluster centroids represent query vectors which are compared against incoming document vectors. Weighted sum tends to perform best at lower dimensionality of the query vector q. In fact, it was devised specifically to provide an advantage with short user requests typical in IR. The performance degrades slightly as q grows. In addition, weighted sum performs considerably better when combined with traditional tf idf weighting (discussed below). Language model Language models furnish a probabilistic approach to computing similarity between a document and a topic (as in centroid clustering) or two documents (nearest neighbour). In this approach, previously seen documents (or clusters) represent models of word usage, and we estimate which model (if any) is the most likely source that could have generated the newly arrived document . Specifically, we are estimating , where is estimated using the background model corresponding to word usage in General English. By making an assumption of term independence (unigram model), we can rewrite , where represent individual tokens in . We use a maximum likelihood estimator for , which is simply the number of occurrences of in divided by the total number of tokens in . Since our models may be sparse, some words in a given document may have zero probability under any given model , resulting in . To alleviate this problem we use a smoother estimate ! " , which allocates a non-zero probability mass to the terms that do not occur in . We set to the Witten-Bell[6] estimate #
منابع مشابه
Track detection on the cells exposed to high LET heavy-ions by CR-39 plastic and terminal deoxynucleotidyl transferase (TdT)
Background: The fatal effect of ionizing radiation on cells depends on Linear Energy Transfer (LET) level. The distribution of ionizing radiation is sparse and homogeneous for low LET radiations such as X or γ, but it is dense and concentrated for high LET radiation such as heavy-ions radiation. Material and Methods: Chinese hamster ovary cells (CHO-K1) were exposed to 4 Gy Fe-ion 2000 keV/...
متن کاملEffects of Dysphagia Therapy on Swallowing Dysfunction after Total Thyroidectomy
Introduction: Swallowing disorder or dysphagia is a common complication after conventional total thyroidectomy. Traditional dysphagia therapy (TDT) has long been a routine rehabilitation program for patients with dysphagia; however, there is no evidence to support the efficacy of this approach in patients with post-thyroidectomy dysphagia. Regarding this, the purpose of the current study ...
متن کاملUMass at TDT 2000
We had two thrusts to our research, neither of which was ready to be deployed in this evaluation. We report here on the results from the training data, in all cases explored within the link detection task. In the first direction, we looked more carefully at score normalization across different languages and media types. We found that we could improve results noticeably though not substantially ...
متن کاملAnticoagulation Strategies for the Orthopaedic Surgeon: Reversal and Timelines
Article Highlights: 1) This article provides a full anticoagulant reference for the practicing orthopaedic surgeon which can be used in any clinical scenario, whether urgent or elective surgical intervention is required 2) A comprehensive list of anticoagulant reversal agents and drugs with short half-lives (for bridging) are described with the intention to provide the data needed t...
متن کاملUMass at TDT 2004
Topic Detection classifies stories into different topics, but HTD requires more than that. Is there any other entities between a story and a topic? [10] views a topic as a structure of inter-related events, which gives us a good hint for this new task. Experiments in [10] show that time locality is a very useful attribute in event organization, and it can also help to solve the complexity probl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000